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A PROBLEM  IN  MULTIVARIATE  STATISTICS: 
ALGORITHM,  DATA  STRUCTURE  AND  APPLICATIONS 

Jon  Louis  Bsntley  and  Michael  Ian  Shanos 
Departments  of  Computer  Science  and  Mathematics 

April  1978 


Abstract 

We  investigate  probiems  and  applications  associated  with  computing 
the  empiricai  cumulative  distribution  function  of  N points  in  k- 
dlmenslonal  space  and  employ  a multidimensional  divide-and-conquer 
technique  that  gives  rise  to  a compact  data  structure  for  geometric 
and  statistical  search  problems.  We  are  able  to  show  how  to 
compute  a large  number  of  Important  statistical  quantities  much 
faster  than  was  previously  possible. 
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The  computational  complexity  of  statistical  procedures  lias  just  begun  to  be 
Investigated  [Gonzalez  77]  [Shamos  77]  but  proves  to  be  a rich  source  of  now 
theoretical  problems  and  algorithm  design  questions.  In  this  paper  we  conduct 
an  exhaustive  analysis  of  an  important  computational  problem  in  statistics 
Involving  a novel  data  structure  that  is  a direct  outgrowth  of  the  dlvide-and- 
conquer  algorithm  used  to  solve  the  problem.  We  establish  lower  bounds  on 
computation  time,  relate  the  problem  to  some  searching  and  combinatorial 
questions,  and  present  a variety  of  applications. 


A multivariate  statistical  sample  of  N observations  on  each  of  k variables  can 
be  regarded  conveniently  as  a set  of  N points  In  Euclidean  k-space.  We  say 
that  a point  X * (xi,...,X|()  dominates  point  Y,  written  X i Y,  Iff  X|  ^ y|  for  all  I. 
That  Is,  X dominates  Y iff  It  is  greater  than  or  equal  to  Y In  all  coordinates.  The 
dominance  relation  is  easily  soon  to  define  a partial  order  on  vectors.  We 
assume  for  simplicity  that  all  N points  are  cHstinct,  but  this  will  not  affect  the 
asymptotic  running  times  of  our  algorithms.  The  rank  r(Z)  of  a point  Z (not 
necessarily  a sample  point)  Is  the  number  of  sample  points  dominated  by  2.  The 
normalized  rank,  r(Z)/N,  which  is  the  fraction  of  points  dominated  by  Z,  is  better 
known  as  the  empirical  cumulative  distribution  function  (ECDF)  and  arises  In  a 
host  of  statistical  applications.  (The  use  of  the  word  "rank"  in  this  context  is 
•uggostlve  but  is  a slight  misnomer  because  the  points  are  not  linearly  ordered. 
A different  definition  of  rank  is  given  in  [Yoo  74]  but  is  unrelated  to  the  ECDF). 
We  now  distinguish  two  computational  problems: 


1.  (All  points  ECDF)  Given  N points  In  k-dimenslonal  space,  find  the  number  of 
points  dominated  by  each. 


1 
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2.  (ECDF  search)  Given  N points  in  k-dimensional  space,  with  preprocessing 
allowed,  find  r(Z)  for  a new  arbitrary  point  Z (without  adding  Z to  the  data 
structure). 

The  all-points  ECOF  problem  Is  the  crucial  step  In  computing  the  statistics 
associated  with  the  Hoeffding,  Kolmogorov-Smirnov  (K-S),  and  Cramor-Von  Mises 
tests  [Hollander  73]  [Hajek  07].  (These  applications  are  discussed  below.)  It 
includes  the  vector  maxima  problem  of  [Kung,  et  al.  75]  as  a special  case,  since 
a maximum  (In  their  parlance)  is  defined  as  a vector  that  Is  not  dominated  by 
any  other.  The  reflection  Z - -Z  transforms  a maximum  into  a vector  whose 
rank  is  zero,  so  the  ECOF  problem  can  be  used  to  find  maxima. 


In  econometrics,  it  is  common  to  represent  the  yield  of  a combination  of 
investments  as  a point  In  multidimensional  space,  and  one  is  interested  In 
strategies  that  are  not  dominated  by  any  others.  By  this  view,  one  step  in  the 
selection  of  an  investment  portfolio  is  an  EDCF  problem. 


The  ECDF  generalizes  the  notion  of  inversions  of  a permutation.  Consider  a 


two-dimensional  set  (X|,y|),  such  that  the  X|  are  In  Increasing  order.  Projecting 
the  points  on  the  y-axis  and  reading  them  in  increasing  y-order  induces  a 
permutation  Tri,...,7r|^  of  {1,...,N).  Point  i Is  dominated  by  point  J Iff  i<J  (that  la, 
X|<Xj)  and  trj<irj.  (See  Figure  1). 


Point  5 dominalat  point  2 
boesusa  2 pracodat  5 in 
tha  inducad  y-parmutalion 
246351  


Projacting  tha  numbarad  points 
onto  tha  y-axis  inducas  a 
par  mutation  of  1 
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Points  ara  numbarad  in 
ordar  of  incraaaing  x-value. 
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Figure  1 : The  connection  between  domination  and  inversions. 


The  number  of  points  dominated  by  (Xj.yp  Is  the  number  of  inversions  of  ir  In 
which  I participates.  This  shows  that  the  ECDF  Is  fundamentally  a discrete 
problem  and  is  based  only  on  the  ranks  of  the  coordinates,  not  on  their  actual 
values.  The  generalization  to  higher  dimensions  Is  now  clear.  Let  P^,...,P3  be  a 

collection  of  permutations  of  1 N.  Two  integers  i,J  with  Vi  l,i,  i N will  be  said 

to  form  a s-inverslon  Iff  I < J but  I follows  J in  each  permutation.  For  example, 
the  pair  (2,5)  is  a 4-inversion  in  the  following  set: 

453126  561243  631452  531462 

(Because  2 follows  5 In  each  permutation). 


Determining  the  number  of  k-inversions  In  which  each  integer  is  Involved  Is 
equivalent  to  a (k-f  1 )-climenslonal  ECOF  problem. 

Finding  the  number  of  elements  dominated  by  each  member  of  a general  partial 
order  must  take  O(N^)  time  in  the  worst  cosn  but  the  dominance  relation  we 
have  defined  is  a partial  order  of  a special  type  — It  is  induced  by  the 
lexicographic  product  of  k linearly  ordered  sets.  This  stiucture  will  lead  to  a 
fast  algorithm.  This  algebraic  view  leads  to  Interesting  combinatorial  questions, 
such  as  characterizing  those  posets  that  are  isomorphic  to  a set  of  points  in  k- 
space  under  the  dominance  ordering. 

The  ECOF  is,  In  a very  powerful  sense,  an  excellent  estimator  of  the 
underlying  population  CDF,  which  wo  often  wish  to  determine.  In  order  to  be  able 
to  use  this  function  we  must  be  able  to  compute  its  value  r(Z)  at  an  arbitrary 
point  Z.  A typical  application  Is  the  multivariate  Kolmogorov- Smirnov  procedure 
for  testing  the  hypothesis  that  two  samples  have  come  from  the  same  underlying 
distribution.  The  test  statistic  K is  equal  to  the  maximum  absolute  difference 
between  the  ECDFs  of  the  two  samples.  To  compute  K It  suffices  to  evaluate 
the  ECDF  of  sample  A at  each  of  the  points  in  sample  B,  and  vice-versa.  This 
entails  a search  for  each  point  of  B to  determine  how  many  points  in  A it 
dominates.  Below  we  discuss  a number  of  search  algorithms  that  Illustrate 
various  time-space  tradeoffs  and  concentrate  on  one  that  runs  in  O(log^N)  time 
(in  two  dimensions)  and  requires  only  0(N  log  N)  space  and  preprocessing  time. 


2.  ECDF  Algorithms 

It  is  easy  to  solve  the  all-points  ECOF  problem  in  O(kN^)  time  by  comparing 
each  of  the  N points  against  every  other  point  to  determine  how  many  It 
dominates.  While  In  a general  partial  order  of  N-vectors  this  would  also  be  a 
trivial  lower  bound  on  the  time  needed  to  compute  the  number  of  vectors 
dominated  by  each,  we  have  seen  that  the  dominance  oroerlng  is  of  special 
form.  In  the  present  case  we  will  use  this  structure  to  Improve  on  the  naive 
algorithm. 


We  employ  a multidimensional  divide-and-conquer  scheme  similar  to  the  one 
described  in  [Bentley  76a]  and  [Bentley  76b],  which  at  each  level  of  recursion 
reduces  the  dimension  of  the  problem  by  one  and  the  number  of  points  by  a 
factor  of  two. 


Thaormm  It  The  ail-points  ECOF  problem  can  be  solved  in  0(N  log'^'^N)  time  In 
the  worst  case. 


Algorithm: 


1 . Let  P be  a hyperplane,  normal  to  one  of  the  coordinate  axes,  that  divides 
the  collection  of  points  into  two  subsets  A and  B,  each  containing  N/2 


points.  Such  a plana  can  ba  found  in  0(N)  tima  by  choosing  P to  pass 
through  a point  that  has  madlan  first  coordinata. 

2.  Recursively  solve  the  all-points  problem  on  subsets  A and  B.  That  Is,  for 
each  point  In  A we  find  the  number  of  points  in  A that  It  dominates,  and 
similarly  for  B.  If  T(N,k)  Is  the  time  required  to  solve  the  entire  problem, 
then  the  solution  of  these  two  subproblems  can  be  accomplished  In  time 
2T(N/2,k). 

3.  Without  loss  of  generality,  let  A be  the  set  of  points  whose  first  coordinate 
does  not  exceed  that  of  any  point  of  B.  Note  that  the  ECOF  values 
obtained  for  set  A in  the  recursive  subproblem  solution  are  the  correct  final 
values,  since  P was  constructed  so  that  no  point  of  A can  dominate  any 
point  of  B.  It  remains  only  to  update  the  B values  to  reflect  the  number  of 
points  in  A that  are  dominated. 

4.  We  now  observe  that  each  point  of  B already  dominates  each  point  of  A in 
at  least  one  coordinate,  namely,  the  coordinate  whose  axis  is  normal  to  the 
dividing  plane  P.  This  coordinate  can  thus  be  removed  from  further 
consideration  in  forming  the  corrected  solution  for  B.  The  coordinate  can  ba 
eliminated  without  changing  any  dominance  relations  by  merely  projecting  all 
of  the  points  onto  P,  which  Is  a subspace  of  one  lower  dimension.  This 
projection  can  be  accomplished  in  0(N)  time  if  pointers  are  used  instead  of 
copying  lists  of  coordinates.  (Otherwise,  O(kN)  time  would  be  required.) 
The  projected  subproblem  can  be  solved  In  time  T(N,k-1).  Note  that  the 
"subproblem”  is  of  a somewhat  special  form,  as  we  are  only  interested  In 
learning  for  each  point  of  B the  number  of  points  in  A that  it  dominates. 

5.  Combining  the  subproblem  solutions  obtained  in  steps  2 and  4 can  be 
accomplished  In  0(N)  time,  so  the  recurrence  relation  for  T is 

T(N,k)  ■ 2T(N/2,k)  ♦ T(N,k-1 ) ♦ 0(N)  . 

■By  sorting  the  points  In  advance  on  each  coordinate  we  may  make  use  of 
the  fact  that  T(N,2)  ■ 0(N  log  N)  [Shamos  77]  to  obtain 

T(N,k)  ■ 0(N  log'^'^N)  ♦ 0(kN  log  N)  . 


In  unusual  circumstances,  the  number  of  dimensions  may  greatly  exceed  the 
number  of  sample  points,  N,  in  which  case  the  above  recursion  Is  inefficient 
because  Its  effort  Is  concentrated  on  reducing  the  number  of  points  in  the 
subproblems.  If  k > N^,  the  method  of  [Yao  74],  which  expilcitly  constructs  the 
matrix  of  the  partial  order,  can  be  used  to  compute  the  ECDF  in  0(kN^/log  N) 
time.  If  only  the  vector  maxima  are  desired,  a slight  modification  of  the  above 
algorithm  achieves  the  0(N  log^^'^N)  performance  attained  (for  k>2)  in  [Kung 
75].  It  has  been  shown  [Bentley  77a]  [Bentley  77b]  that  this  modified  maxima 
algorithm  runs  In  0(N)  expected  time  for  a very  wide  class  of  Input  distributions. 


I 

I 

f 

* 

I 

I 


3.  ECDF  Searching 

Once  the  all-points  problem  has  been  solved,  we  are  In  a position  to  arrange 
the  solution  into  a data  structure  that  will  make  it  easy  to  determine  the  number 
of  points  dominated  by  a new  point  X.  The  most  simple  algorithm  requires  no 
preprocessing  at  all  and  operates  by  comparing  X to  each  of  the  N k-dimensionai 
sample  points.  This  obvious  approach  requires  O(kN)  initialization  time,  O(kN) 
query  time,  and  O(kN)  storage.  It  is  somewhat  surprising  that  the  query  time  can 
be  significantly  reduced  with  no  asymptotic  increase  in  the  storage  used.  The 
method  of  k-d  trees  [Bentley  75]  achieves  OtkN^"^^*^)  search  time  after 
0(kN  log  N)  preprocessing,  but  still  needs  only  O(kN)  storage  [Lee  77]. 


It  is  possible  to  perform  ECDF  searching  extremely  rapidly  if  sufficient  storage 
and  preprocessing  time  are  available.  The  method  Is  based  on  the  fact  that 
there  exist  k-dimensional  rectangular  parallelepipeds  within  which  the  number  of 
dominated  points  remains  constant.  We  may  easily  see  why  this  Is  true. 
Consider  some  point  Z (not  in  the  original  set)  for  which  r(Z)  ^ 5 and  imagine 
moving  Z parallel  to  some  coordinate  axis.  (Refer  to  Figure  2.) 
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Figure  2:  The  ECDF  Is  constant  within  rectangular  regions. 


The  value  of  r(Z)  cannot  change  until  Z passes  the  projection  of  some  point  of 
the  original  set  on  that  axis.  This  Is  true  of  each  coordinate.  If  we  were  to 
construct  hyperplanes  normal  to  the  coordinate  axes  at  each  sample  point  (a 
total  of  Nk  hyperplanes),  these  would  divide  space  into  (N-fi)^  rectangular 
regions  within  each  of  which  the  function  r is  constant.  Such  a structure  can 
readily  be  queried  In  0(k  log  N)  time  by  a binary  search  along  each  coordinate, 
so  we  have 

Theorem  2.«  ECDF  searching  can  be  performed  in  0(k  log  N)  time,  with 
0(N*^  * kN  log  N)  storage  and  preprocessing  time. 
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The  storage  used  by  this  procedure  is  prohibitive.  We  propose  a date 
structure  and  search  algorithm  that  nearly  achieves  O(log  N)  search  time,  but 
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which  uses  less  than  quadratic  storage.  The  data  structure  Is  "Isomorphic”  to 
the  all-points  ECOF  algorithm  In  the  sense  that  It  is  a tree  structure  having  a 
branch  corresponding  to  each  recursive  call  in  the  algorithm.  The  storage 
required  by  the  search  procedure  and  the  time  used  by  the  all-points  algorithm 
are  described  by  exactly  the  same  recurrence  relation.  Furthermore,  the  data 
structure  for  searching  can  be  built  conveniently  during  the  solution  of  the  all- 
points  problem  at  no  asymptotic  increase  in  running  time. 

Let  us  first  treat  the  two-dimensional  case  (refer  to  Figure  3).  The 
dividing  line  L Is  the  two-dimensional  instance  of  the  hyperplane  P described  in 
the  all-points  algorithm,  and  is  used  to  define  the  first  test.  Let  A be  the  set  of 
N/2  points  to  the  left  of  line  L and  let  B be  the  set  of  points  to  the  right.  Given 
a new  point  Z we  want  to  determine  the  number  of  points  (In  both  A and  B)  that 
it  dominates.  In  a single  comparison  against  L we  can  determine  whether  Z lies 
in  A or  in  B.  If  Z lies  In  A (the  diagram  on  the  left  in  Figure  3)  it 
cannot  possibly  dominate  any  point  of  B,  so  we  may  confine  our  attention  to  a 
subproblem  of  half  the  size  of  the  original.  The  recurrence  describing  this 
situation  Is  Just 

T(N)  • T(N/2)  ♦ 1 . 
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Figure  3:  The  two  cases  of  ECDF  searching  in  the  plane. 


If  we  learn  from  the  first  comparison  that  Z lies  in  B then  the  problem  is  only 
slightly  more  complicated  (the  right  diagram  In  Figure  3).  We  must  find  the 
number  of  points  in  B that  are  dominated  by  Z,  which  can  be  done  In  time  T(N/2). 
We  then  add  to  that  the  number  of  points  in  A dominated  by  Z.  Since,  however, 
the  x-coordinate  of  Z is  known  to  be  greater  than  that  of  point  of  A,  this  number 
is  merely  the  number  of  points  of  A that  lie  below  Z.  If  we  project  the  points  of 
A onto  L and  sort  them  in  advance  (as  part  of  the  preprocessing)  we  will  be  able 
to  locate  Z in  this  ordering  in  O(log  N)  time  by  binary  search.  Thus  the 
recurrence  that  results  when  Z is  in  B is 

T(N)  ■ T(N/2)  ♦ Odog  N)  . 
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It  Is  immediate  that  T(N)  ■ O(log^N),  even  if  the  second  case  arises  after  each 
comparison. 

The  generalization  to  k dimensions  is  completely  straightforward.  The  iine  L is 
replaced  by  a hyperplane  and  the  sorted  list  by  a (k-D-dimensionai  ECOF 
search  structure.  The  search  time  is  given  by  the  recurrence 

T(N.k)  - T(N/2.k)  ♦ T(N.k-l),  T(N.1 ) = 0(iog  N). 

of  which  the  solution  Is  T(N,k)  « 0(log*^N).  The  storage  requirement  of  this 
algorithm  Is  easy  to  analyze  In  view  of  Its  recursive  structure.  In  two  dimensions 
we  need  to  store  two  data  structures  on  H/Z  points  and  one  linear  list  of  length 
N/2.  Thus,  S(N,2)  ~ 2S(N/2,2)  0(N)  > 0(N  log  N).  In  k dimensions,  we  have 

S(N,k)  » 2S(N/2,k)  ♦ S(N/2,k-1)  = 0(N  log*‘'^N)  . 

The  preprocessing  time  is  described  by  precisely  the  same  relation,  giving 

Theorem  3:  ECOF  searching  can  be  accomplished  in  O(log^N)  time,  using 
0(N  log^'^N)  storage  and  0(N  log^'^N)  time  for  preprocessing. 

4.  Applications 

We  now  will  present  some  new  appticotions  of  the  ECOF  algorithms  and 
elaborate  on  some  of  those  presented  in  the  introduction. 

4.1  Range  Queries 

I An  Inconvenient  but  common  type  of  geometric  search  problem  Is  the  range 

j query  [Knuth  T3].  Given  a set  of  N points  in  the  piane,  with  preprocessing 

allowed,  how  many  lie  In  the  rectangle  defined  by  a i x £ b and  c £ y £ d ? 
Within  the  framework  of  the  ECOF  problem,  range  searching  becomes 
elementary.  In  two  dimensions,  (see  Figure  4)  we  may  find  the 
number  of  points  lying  In  rectangle  ABCO  by  computing 

ECOF(A)  - [ ECOF(B)  ♦ ECOF(C)  ] ♦ ECOF(O)  . 

This  Is  a simple  application  of  the  combinatorial  principle  of  inclusion-exclusion. 
We  are  thus  able  to  respond  to  a range  query  by  evaluating  the  distribution 
function  at  four  points. 

In  k dimensions  we  need  to  evaluate  the  ECOF  at  2*^  points,  but  this  stiil  oniy 
requires  O(log'^N)  time  and  0(N  log^'^N)  storage  (or,  0(2*^log  N)  time  and  0(N*^) 
storage.)  The  range  query  example  provides  a link  between  the  empirical 
distribution  function  and  the  empirical  density  function.  The  fraction  of  a set 
contained  In  a piane  region  F is  a consistent  estimator  of  the  probability  content 
of  that  region  (the  probability  density  Integrated  over  F)  [Loftsgaarden  65]. 
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Figur«  4:  Rang*  searching  as  an  ECOF  problem. 


4.2  Kolmogorov-Smlrnov  Statistic 

Because  of  its  intimate  relation  to  the  ECOF,  we  point  out  an  anomaly  between 
the  K-S  one-  and  two-sample  tests.  The  K-S  one-sample  statistic  is  the 
maximum  devlotion  between  the  ECOF  of  a finite  point  set  and  a given 
hypotheticai  CDF  (which  we  assume  can  be  evaluated  at'  a single  point  in 
constant  time).  A linear-time  one-sample  K-S  algorithm  in  one  dimension  has 
been  given  by  [Gonzaiez  77];  it  makes  use  of  the  fact  that  any  COF  must  be  a 
monotonic  function.  While  the  situation  In  higher  dimensions  is  unclear,  the  ECOF 
algorithm  of  this  paper  can  be  used  to  compute  the  K-S  statistic  in  0(N  log^'^N) 
time.  The  K-S  two-sample  statistic  is  the  maximum  deviation  between  the 
ECDFs  of  two  given  finite  point  sets. 

Theorem  4.*  The  Kolmogorov  two-sample  statistic  must  take  0(N  log  N)  time  to 
compute.  In  the  worst  case. 

Proof:  The  K-S  two-sample  statistic  is  zero  iff  the  point  sets  are  identical. 
Set  equality  is  shown  to  require  0(N  log  N)  comparisons  in  [Reingold  72]. 
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Summary 


We  see  that  the  empirical  cumulative  distribution  function,  a ubiquitous 


quantity  in  statistical  analysis,  can  be  computed  quickly  at  the  given  sample 
points  and  can  be  evaluated  quickly  at  other  points.  The  data  structure  for 
ECOF  searching  was  arrived  at  directly  from  the  ECOF  algorithm  Itself.  The 
problems  we  have  considered  impinge  on  many  others  In  different  applications 
areas,  ail  of  which  may  be  solved  by  the  techniques  developed  here. 
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